Skip to content

feat(binary): add git binary patch support#61

Open
weihanglo wants to merge 12 commits intobmwill:masterfrom
weihanglo:binary-patch
Open

feat(binary): add git binary patch support#61
weihanglo wants to merge 12 commits intobmwill:masterfrom
weihanglo:binary-patch

Conversation

@weihanglo
Copy link
Copy Markdown
Contributor

@weihanglo weihanglo commented Apr 13, 2026

Blocked on #59. Please review from 58f95d7

Basically a port of weihanglo#33.

BTW, this is probably the very last major feature PR. I might have some other small ones for housekeeping and preparation of the new release.


Add parsing and application of git binary patches,
including both literal and delta formats.

This feature is behind the binary Cargo feature.

The implementation is based on

Some others highlights:

  • Bump MSRV to 1.75.0 as zlib-rs requiring that.
  • Parsing of binary format is always available, only decode/apply operations is behind the binary feature.

@weihanglo weihanglo force-pushed the binary-patch branch 5 times, most recently from 61c9c2b to 531e8d9 Compare April 14, 2026 05:00
@weihanglo weihanglo marked this pull request as draft April 15, 2026 22:57
* Parse `diff --git` extended headers
* split multi-file git diffs at `diff --git` boundaries
Compat test for also `git apply`.
Unlike unidiff,
gitdiff produces patches for empty file creations/deletions
(`0\t0` in numstat)
because they carry `diff --git` + extended headers even without hunks.

Binary files (`-\t-\t`) are skipped in gitdiff mode for now.
@weihanglo weihanglo force-pushed the binary-patch branch 3 times, most recently from 7101ee1 to a2190d2 Compare April 17, 2026 16:08
* Added types representing both literal and delta Git binary patches
* Added a parser for the `GIT binary patch` format.

This doesn't include the patch application
(which will be added in later commits)

The implementation is based on

* Specification from <https://diffx.org/spec/binary-diffs.html>
* Behavior observation of Git CLI
Returns the longest valid UTF-8 prefix of the input.
For `str` this is the entire input; for `[u8]` it truncates
at the first invalid byte.

This will be used at call sites where generic `T: Text` input
needs to be narrowed to `&str` for parsers that only handle
ASCII data (e.g. binary patch base85 content).
The API was stabilized in 1.73.
The lint was added in 1.93.

This is required for a MSRV bump to 1.75
This is a preparation for binary diff application support.

* Git binary patch is compressed by zlib hence flate2
* zlib-rs (which is the most performant zlib backend)
  requires MSRV 1.75.0+ hence the bump.
* Add base85 encoder/decoder and Git delta format decoder.
* Wire them into `BinaryPatch::apply() and `apply_reverse()`
  for decoding zlib-compressed, base85-encoded binary payload.

These are feature-gated behind the `binary` feature.
Now both tests require `binary` Cargo feature.
Comment thread src/binary/base85.rs
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use our hand-written one. Alternatives:

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The hand-written one here is solid, and it is nice to avoid an extra dep if we can

@weihanglo weihanglo marked this pull request as ready for review April 17, 2026 16:39
Comment thread src/binary/delta.rs
Comment on lines +49 to +54
} else {
// ADD instruction
let add_len = control as usize;
let data = cursor.read_bytes(add_len)?;
result.extend_from_slice(data);
}
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its been a while since i've implemented the delta compression algorithm but I think that this may be mis handling the case where control == 0x00 which is reserved by by git but here we're treating it as an ADD of 0 bytes.

Although rereading the diffx spec you linked at the top of this file 0x00 doesn't seem to be called out as special or reserved so we could probably go either way.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! This is definitely worth a compat test. I dare not reading GPL source to confirm that 😨.

Comment thread src/binary/delta.rs
});
}

let mut result = Vec::with_capacity(header_mod_size as usize);
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be possibly dangerous as we're reading a u64 value from a possibly malformed set of delta instructions which could lead to some inputs causing an instant OOM. We probably want to try and cap the amount we're willing to allocate up front in order to avoid small adversarial inputs from crashing the process.

Comment thread src/binary/mod.rs
Comment on lines +110 to +122
fn decode_data(binary_data: &BinaryData<'_>) -> Result<Vec<u8>, BinaryPatchParseError> {
use std::io::Read;

let compressed = decode_base85_lines(binary_data.data)?;

let mut decoder = flate2::read::ZlibDecoder::new(&compressed[..]);
let mut decompressed = Vec::new();
decoder
.read_to_end(&mut decompressed)
.map_err(|e| BinaryPatchParseErrorKind::DecompressionFailed(e.to_string()))?;

Ok(decompressed)
}
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't seem to do any checks that the decompressed data matches the size that is present in BinaryData.size. Do we want to do an extra check for that to ensure the decompressed data is the expected size?

@bmwill
Copy link
Copy Markdown
Owner

bmwill commented Apr 18, 2026

BTW, this is probably the very last major feature PR. I might have some other small ones for housekeeping and preparation of the new release.

Awesome! hopefully getting all this in wasn't too painful.

I have a few smaller things that I think would be nice to add or polish myself so hopefully i can find the time to do that before we are ready to do a release, although I won't hold up a release if i don't get around to those things.

@weihanglo
Copy link
Copy Markdown
Contributor Author

I have a few smaller things that I think would be nice to add or polish myself so hopefully i can find the time to do that before we are ready to do a release, although I won't hold up a release if i don't get around to those things.

Feel free to dump your thoughts in an issue. I can help clean them up!

Here is a rough list in my mind:

  • Error handling
    • big parse error enum maybe
    • precise byte positions or spans
    • user have ways to distinguish between different error kinds, at least some ways to distinguish most possible wanted ones.
  • Public API. We need to review what we expose and if it is in a good shape before release
  • Performance benchmark?
  • Of course a changelog updates (and probably some changelog infra via cargo-release, release-plz, cargo-smart-release, or just simple a CI release job triggered by tag)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants